Introduction:
Analysis was conducted on the lemurs dataset to
determine relationships that are not inherently apparent upon a
basic/elementary inspection. Because of the numerous columns within this
dataset, Principal Component Analysis (PCA) was performed to reduce
dimensionality while preserving the information provided. This dataset
provides information on 27 different types of lemurs, or taxa, collected
by the Duke Lemur Center (DLC). With 82,609 rows and 54 columns, this
dataset contains vast information regarding the lives of roughly 3,060
lemurs of each taxon.
While PCA was performed on the entire dataset, the main columns of
interest within this research document are the following numerical
columns: age_at_death_y, age_at_wt_mo_no_dec,
age_at_wt_y, age_max_live_or_dead_y,
avg_daily_wt_change_g, birth_month,
change_since_prev_wt_g, concep_month,
dam_age_at_concep_y, days_before_death,
days_since_prev_wt, expected gestation,
month_of_weight, litter size,
n_known_offspring, r_min_dam_age_at_concep_y,
sire_at_concep_y, and wight_g. The
non-numerical columns of interest are sex,
hybrid, age_category, and taxon.
These columns give insights on biological and physical features of each
lemur. Performing PCA on this dataset will allow for an increase in
readability and intrepretability while preserving as much information as
possible.
Details on the specific columns are shown below:
age_at_death_y: the age of the lemur at the time of
death
age_at_wt_mo_no_dec: the age of the lemur at the time
‘weight_g’ was recorded rounded down
age_at_wt_y: the age of the lemur at the time ‘weight_g’
was recorded in years
age_max_live_or_dead_y: the maximum age the animal could
have achieved as of the date the datafile was last updated
avg_daily_wt_change_g: the average change of weight for
a lemur each day
birth_month: the month of birth of the lemur
change_since_prev_wt_g: the weight change since previous
recorded weight in grams
concep_month: the month of conception of the lemur
dam_age_at_concep_y: the mother of the lemurs age at
conception
days_before_death: the number of days the weight of the
lemur was recorded before death
days_since_prev_wt: the number of days the initial
weight of the lemur was recorded since new weight was recorded
expected gestation: the expected gestation of the
lemur
litter size: the size of the little of the lemur
month_of_weight: the month that the lemurs weight was
recorded
n_known_offspring: the number of known offspring of the
lemur
r_min_dam_age_at_concep_y: the dam minimum age at
conception
sire_at_concep_y: the father of the lemurs age at
conception
wight_g: wight of the lemur in grams
The dataset analyzed within this research document,
lemurs, is below:
lemurs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv')
head(lemurs)
## # A tibble: 6 × 54
## taxon dlc_id hybrid sex name curre…¹ stud_…² dob birth…³ estim…⁴
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <date> <dbl> <chr>
## 1 OGG 0005 N M KANGA N <NA> 1961-08-25 8 <NA>
## 2 OGG 0005 N M KANGA N <NA> 1961-08-25 8 <NA>
## 3 OGG 0006 N F ROO N <NA> 1961-03-17 3 <NA>
## 4 OGG 0006 N F ROO N <NA> 1961-03-17 3 <NA>
## 5 OGG 0009 N M POOH BEAR N <NA> 1963-09-30 9 <NA>
## 6 OGG 0009 N M POOH BEAR N <NA> 1963-09-30 9 <NA>
## # … with 44 more variables: birth_type <chr>, birth_institution <chr>,
## # litter_size <dbl>, expected_gestation <dbl>, estimated_concep <date>,
## # concep_month <dbl>, dam_id <chr>, dam_name <chr>, dam_taxon <chr>,
## # dam_dob <date>, dam_age_at_concep_y <dbl>, sire_id <chr>, sire_name <chr>,
## # sire_taxon <chr>, sire_dob <date>, sire_age_at_concep_y <dbl>, dod <date>,
## # age_at_death_y <dbl>, age_of_living_y <dbl>, age_last_verified_y <dbl>,
## # age_max_live_or_dead_y <dbl>, n_known_offspring <dbl>, …
More information about the dataset can be found here: https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-08-24 and https://www.nature.com/articles/sdata201419.
Below shows the manipulation/engineering to the dataset:
#lemurs[mapply(is.infinite, lemurs)] <- NA #changing the infinite values to NA/NaN
lemur <- lemurs %>%
select(where(is.numeric) | c("sex","taxon","hybrid","age_category")) %>%
select(-c("age_at_wt_wk","age_at_wt_d","age_at_wt_mo","age_of_living_y","age_last_verified_y","expected_gestation_d","days_before_inf_birth_if_preg","pct_preg_remain_if_preg","infant_lit_sz_if_preg")) %>%
select_if(~ !all(is.na(.))) #selecting those that are not all NA/NaN
head(lemur)
## # A tibble: 6 × 22
## birth_month litter_s…¹ expec…² conce…³ dam_a…⁴ sire_…⁵ age_a…⁶ age_m…⁷ n_kno…⁸
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8 1 129 4 2.22 2.22 15.5 15.5 7
## 2 8 1 129 4 2.22 2.22 15.5 15.5 7
## 3 3 1 129 11 1.78 1.78 13.6 13.6 9
## 4 3 1 129 11 1.78 1.78 13.6 13.6 9
## 5 9 1 129 5 4.32 4.32 10.4 10.4 1
## 6 9 1 129 5 4.32 4.32 10.4 10.4 1
## # … with 13 more variables: weight_g <dbl>, month_of_weight <dbl>,
## # age_at_wt_mo_no_dec <dbl>, age_at_wt_y <dbl>, change_since_prev_wt_g <dbl>,
## # days_since_prev_wt <dbl>, avg_daily_wt_change_g <dbl>,
## # days_before_death <dbl>, r_min_dam_age_at_concep_y <dbl>, sex <chr>,
## # taxon <chr>, hybrid <chr>, age_category <chr>, and abbreviated variable
## # names ¹litter_size, ²expected_gestation, ³concep_month,
## # ⁴dam_age_at_concep_y, ⁵sire_age_at_concep_y, ⁶age_at_death_y, …
Approach:
To perform PCA on a dataset, pre-processing should take place.
Because the prcomp() function only works with numerical
values, a few columns were removed, this was done with the
select(where(is.numeric)) function.
Steps for PCA
1. Standardization:
Standardization(X) = (X(i) - mean(X)) / standard deviation(X)
This entails transforming the data to ensure that each variable within the dataset contributes equally to the analysis. Standardization ensures that there are no biases toward specific variables.
2. Covariance Matrix:
Covariance(X) = E [(X − E[X]) (X − E[X]).T]
Where X is the vector, E represents the expectation of the vector X, and .T depicts the transformation of a vector.
This step is essential in determining if there are any relationships between the variables. The correlation matrix is of square size of the length of the datasets numerical columns and each value corresponds to the relationship between the associated variables.
3. Identify Principal Components
The principal components are determined by computing the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors determine the direction and eigenvalues are the magnitude of variance each initial variable has with each principal component. The principal components are constructed as mixtures of the initial variables. The components act as new variables that the initial variables are uncorrelated with (i.e., perpendicular to). Finding the principal components of a dataset allows for better interpretation of the data and the relationships that may exist within.
4. Feature Vectors and Recasting
Feature vectors are necessary for dimensionality reduction. In this step, the eigenvectors and eigenvalues are taken and sorted from high to low significance. The eigenvalues that are of low magnitude have lower significance and can be discarded from the analysis. Recasting just entails graphing the initial variables onto the principal components shown below in Figure 3.
Below shows the computation of the principal components:
lemur_pca <- na.omit(lemur) %>%
select(where(is.numeric)) %>% # retain only numeric columns
scale() %>% # scale to zero mean and unit variance
prcomp() # performing principal component analysis
Below shows a summary of the principal component analysis performed:
summary(lemur_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9345 1.7815 1.3207 1.16583 1.12656 1.03264 0.99839
## Proportion of Variance 0.2079 0.1763 0.0969 0.07551 0.07051 0.05924 0.05538
## Cumulative Proportion 0.2079 0.3842 0.4811 0.55662 0.62712 0.68636 0.74174
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.96297 0.94022 0.82511 0.78650 0.71344 0.65474 0.59451
## Proportion of Variance 0.05152 0.04911 0.03782 0.03437 0.02828 0.02382 0.01964
## Cumulative Proportion 0.79326 0.84237 0.88019 0.91456 0.94284 0.96665 0.98629
## PC15 PC16 PC17 PC18
## Standard deviation 0.49682 0.002492 0.0003523 5.334e-14
## Proportion of Variance 0.01371 0.000000 0.0000000 0.000e+00
## Cumulative Proportion 1.00000 1.000000 1.0000000 1.000e+00
Analysis:
The below graphs are the initial variables graphed against the first and second principal components, colored by the sex, age_category, and taxon. The first and second principal components are the two principal components with the highest variance with the initial variables.
arg_lemur <- lemur_pca %>%
augment(na.omit(lemur))
sex_pca_plot <- ggplot(
arg_lemur,
aes(.fittedPC1, .fittedPC2, color = sex)
) +
geom_point(
alpha = 0.85
) +
ggtitle("Principal Components") +
labs( #adding labels
x = "Principal Component 1",
y = "Principal Component 2",
subtitle = "Figure 1"
) +
scale_color_brewer(
name = "Sex",
palette = "Set2", #custom palette
guide = guide_legend(nrow=2)
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
age_pca_plot <- ggplot(
arg_lemur,
aes(.fittedPC1, .fittedPC2, color = age_category)
) +
geom_point(
alpha = 0.85
) +
labs( #adding labels
x = "Principal Component 1",
y = "Principal Component 2"
) +
scale_color_brewer(
name = "Age Category",
labels = c("Adult","Infant or Juvinile","Young Adult"),
palette = "Accent", #custom palette
guide = guide_legend(nrow=3)
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
sex_pca_plot | age_pca_plot
arg_lemur <- lemur_pca %>%
augment(na.omit(lemur))
taxon_pca_plot1 <- ggplot(
arg_lemur,
aes(.fittedPC1, .fittedPC2,color = taxon) #assigning color as the sex
) +
geom_point(
alpha = .85
) +
ggtitle("Principal Components") +
labs( #adding labels
x = "Principal Component 1",
y = "Principal Component 2",
subtitle = "Figure 2",
) +
scale_color_manual(
name = "",
values = c("#a3df5a","#5a124b", "#d185f2", "#ada6e1","#8cfaf7","#07d4c0","#00f945","#f1f653","#ee8d00","#ff7c5b","#ca4732","#f886c0","#767293","#3f6a69","#055049","#039f2f","#8a8d34","#8d5504","#8e4836","#71291e","#ff32d3","#e098ff","#d8d3ff","#99fffd","#97ffb4","#1f73f5","#f9947a","#ff2200") , #custom palette
guide = guide_legend(nrow=14)
) +
#scale_color_viridis(discrete = TRUE) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "none",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
taxon_pca_plot2 <- ggplot(
arg_lemur,
aes(.fittedPC8, .fittedPC9,color = taxon) #assigning color as the sex
) +
geom_point(
alpha = .85
) +
labs( #adding labels
x = "Principal Component 8",
y = "Principal Component 9",
) +
scale_color_manual(
name = "Taxon",
values = c("#a3df5a","#5a124b", "#d185f2", "#ada6e1","#8cfaf7","#07d4c0","#00f945","#f1f653","#ee8d00","#ff7c5b","#ca4732","#f886c0","#767293","#3f6a69","#055049","#039f2f","#8a8d34","#8d5504","#8e4836","#71291e","#ff32d3","#e098ff","#d8d3ff","#99fffd","#97ffb4","#1f73f5","#f9947a","#ff2200") , #custom palette
guide = guide_legend(nrow=14)
) +
#scale_color_viridis(discrete = TRUE) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "right",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank(),
legend.text=element_text(size=8),
legend.spacing.x = unit(0.0, 'cm')
)
taxon_pca_plot1 | taxon_pca_plot2
The graph below shows the initial variables graphed against the first and second principal components. This time the variables are graphed by their eigenvectors and eigenvalues.
arrow_style <- arrow(
angle = 20, length = grid::unit(6, "pt"),
ends = "first", type = "closed"
)
lemur_pca %>%
# extract rotation matrix
tidy(matrix = "rotation") %>%
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) %>%
ggplot(aes(PC1, PC2,color = factor(column))) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style,
size = 0.9, alpha = 0.9
) +
ggtitle("Principal Components") +
labs( #adding labels
x = "Principal Component 1",
y = "Principal Component 2",
color = "Variables",
subtitle = "Figure 3"
) +
scale_color_manual(
name = "",
values = c("#a3df5a","#5a124b", "#d185f2", "#ada6e1","#8cfaf7","#07d4c0","#00f945","#f1f653","#ee8d00","#ff7c5b","#ca4732","#f886c0","#767293","#3f6a69","#055049","#039f2f","#8a8d34","#8d5504","#8e4836","#71291e","#ff32d3","#e098ff","#d8d3ff","#99fffd","#97ffb4","#1f73f5","#f9947a","#ff2200") , #custom palette
guide = guide_legend(ncol=1)
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "right",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank(),
legend.text=element_text(size=7),
legend.spacing.y = unit(0.0, 'cm')
)
Below shows a bar graph of the variance that each principal component has with the initial variables.
lemur_pca %>%
# extract eigenvalues
tidy(matrix = "eigenvalues") %>%
ggplot(aes(PC, percent, fill = "")) +
geom_col() +
geom_point() +
geom_line() +
scale_x_continuous(
breaks = 1:18 # create one axis tick per PC
) +
scale_y_continuous(
# format y axis ticks as percent values
label = scales::label_percent(accuracy = 1)
) +
ggtitle("Variance Explained by each Principal Component") +
labs( #adding labels
x = "Principal Component",
y = "Variance Explained",
subtitle = "Figure 4"
) +
scale_fill_brewer(
palette = "Paired" #custom palette
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = 'none',
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
Discussion:
Principal component analysis is an essential part of machine learning and data science because it allows for better interpretation of data. This analysis allows for dimensionality reduction without the loss of valuable information that the dataset may contain. Generally, PCA is used for identifying relationships in datasets where variables are strongly correlated. This analysis showed that the variables within this data are not strongly correlated.
The analysis done on the lemurs dataset is shows various
relationships. A depiction of the initial variables plotted against the
first two principal components is shown in Figure 1. The graphs colored
by both sex and age_category are shown for
grouping visualizations purposes. Figure 2 shows how when plotting
against two principal components that have high variance such as the
graph on the left (PC1 vs. PC2), the groups are profound. Whereas in the
right plot (PC8 vs. PC9) the groups are less extreme because the
variance of the principal components with the initial variables are
lower.
In Figure 3, age_at_wt_y and
age_max_live_or_dead_y show to have a strong positive
relationship with Principal Component 1 and
expected_gestation has a moderate negative relationship
with Principal Component 1. This component could be a potential measure
for time or number of years. The variables
expected_gestation,r_min_dam_age_at_concep_y,
dam_age_at_concep_y, sire_age_at_concep_y, and
weight_g all have a strong negative relationship with
Principal Component 2, whereas avg_daily_wt_change_g and
litter_size tend to have a strong positive relationship
with Principal Component 2. This component could measure fertility
factors. The days_since_prev_wt column seems to have a
minimal relationship with Principal Component 1 and Principal Component
2.
Finally, Figure 4 portrays the variance each principal component has with the initial variables. The higher variance in a principal component corresponds to higher dispersion with the initial variables, thus containing more information to draw from. Other cases for PCA include image processing and many classification tasks.
Disclaimer: A number of variables were
removed in this analysis for simplicity. These columns include:
age_at_wt_wk, age_at_wt_d,
age_at_wt_mo, age_of_living_y,
age_last_verified_y, expected_gestation_d,
days_before_inf_birth_if_preg,
pct_preg_remain_if_preg,
infant_lit_sz_if_preg. This does not compromise the
validity of the findings.